Spatial Analysis of Travel Distances Across Different Modes Using Geographically Weighted Regression (GWR)

Goal

The goal is to understand how travel behavior (e.g., distances traveled for different modes of transportation) varies across different counties through a spatial regression analysis.

Data Sources

This vignette uses a database of 26,095 sample households residing in California, containing detailed information on their travel behavior on one assigned day for each household, from April 19, 2016 through April 25, 2017, provided by the Transportation Secure Data Center (TSDC). [1] Specifically, we obtained travel mode data from PersonData.Rds, HHData.Rdsand geographical data, i.e. the physical location and shape of each county, fromcounties.shp, files given in the dataset.

[1]“Transportation Secure Data Center.” (2019). National Renewable Energy Laboratory. Accessed Jan. 15, 2019: www.nrel.gov/tsdc.

Methodology

Geographically Weighted Regression (GWR):

Geographically Weighted Regression (GWR) is a spatial analysis technique that extends traditional regression by allowing the relationships between dependent and independent variables to vary spatially. Unlike ordinary least squares (OLS) regression, which assumes global stationarity of the coefficients, GWR incorporates geographic context into the model. This approach accounts for spatial heterogeneity, a common characteristic in spatial datasets, where relationships can change over space due to localized factors.

In GWR, the regression is performed repeatedly for each location in the dataset, weighting observations according to their spatial proximity to the focal location. The weighting is determined using a kernel function, which can be fixed or adaptive, depending on the data’s spatial distribution.[2]

Bandwidth Selection:

The selection of an appropriate bandwidth is a crucial step for GWR model. Bandwidth is a parameter that governs the spatial extent, over which neighboring observations influence the estimation of local parameters. The bandwidth serves as a key filter determining the degree of localization in the analysis.

A bandwidth that is too narrow may lead to oversensitivity to local variations, potentially capturing noise in the data. On the other hand, too broad bandwidths can result in over smoothed representations , masking subtle spatial patterns. With a proper bandwidth value, we are able to achieve the balance to ensure the GWR model accurately captures the true spatial heterogeneity without being unduly influenced by distant observations.

Adaptive bandwidths offer an effective solution, as they can vary based on the size of each geographical area and that of its neighbors. Thus, the model can select a narrower bandwidth in dense areas, and a larger one for suburban areas. [3]

In our analysis, we determined the optimal bandwidth using the `bw.gwr` function, which is 35.

[2] Charlton, M., & Fotheringham, A. S. (2009). Geographically weighted regression. [White Paper].

[3] Kiani et al.(2024, February 29). Mastering geographically weighted regression: Key considerations for building a robust model: Geospatial Health. Mastering geographically weighted regression: key considerations for building a robust model | Geospatial Health. https://www.geospatialhealth.net/gh/article/view/1271/1365

Data Pre-Processing

Merging data

Group the data by county (CTFIP), and calculates the average total number of miles traveled by a person for each travel mode (Drive Alone, Drive with Others, Passenger, Walk, and Total) on the survey day.

Visualizing County Data on a Map

shapefile_sum <- shapefile %>%
  left_join(summarized_data, by = "CTFIP")

mapviewOptions(fgb = F)

mapview(shapefile_sum, 
        zcol = "avg_Sum_Pmt", # assigned color based on sum distance
        legend = TRUE,
        label = as.character(shapefile_sum$CTFIP),  
        popup = popupTable(shapefile_sum, 
                           zcol = c("avg_DriveAlone_Dist", 
                                    "avg_Driveothers_Dist", 
                                    "avg_Passenger_Dist", 
                                    "avg_Walk_Dist", 
                                    "avg_Bike_Dist",
                                    "avg_Sum_Pmt")))

Perform Geographically Weighted Regression

Select GWR Bandwidth

Use bw.gwr() to find the optimal bandwidth for GWR analysis.

coords <- st_coordinates(st_centroid(st_geometry(shapefile_sum)))

gwr_data <- shapefile_sum %>%
  select(avg_DriveAlone_Dist, avg_Driveothers_Dist, avg_Passenger_Dist, avg_Walk_Dist, avg_Bike_Dist, avg_Sum_Pmt) %>%
  st_drop_geometry()

gwr_data <- cbind(gwr_data, coords)

gwr_formula <- avg_Sum_Pmt ~ avg_DriveAlone_Dist + avg_Driveothers_Dist + avg_Passenger_Dist + avg_Walk_Dist + avg_Bike_Dist

# convert to spatial data
coordinates(gwr_data) <- ~X + Y
proj4string(gwr_data) <- CRS("+proj=longlat +datum=WGS84")

# Perform bandwidth selection for GWR
bw <- bw.gwr(
  formula = gwr_formula,
  data = gwr_data,  
  adaptive = T)
## Adaptive bandwidth: 43 CV score: 36.66803 
## Adaptive bandwidth: 35 CV score: 36.27212 
## Adaptive bandwidth: 28 CV score: 40.56964 
## Adaptive bandwidth: 37 CV score: 37.88228 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 34 CV score: 36.77004 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 32 CV score: 38.86713 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 31 CV score: 40.03792 
## Adaptive bandwidth: 30 CV score: 40.67171 
## Adaptive bandwidth: 30 CV score: 40.67171 
## Adaptive bandwidth: 29 CV score: 41.34116 
## Adaptive bandwidth: 29 CV score: 41.34116 
## Adaptive bandwidth: 28 CV score: 40.56964 
## Adaptive bandwidth: 28 CV score: 40.56964 
## Adaptive bandwidth: 27 CV score: 40.22093 
## Adaptive bandwidth: 27 CV score: 40.22093 
## Adaptive bandwidth: 26 CV score: 40.56296 
## Adaptive bandwidth: 26 CV score: 40.56296 
## Adaptive bandwidth: 25 CV score: 40.20452 
## Adaptive bandwidth: 25 CV score: 40.20452 
## Adaptive bandwidth: 24 CV score: 39.88842 
## Adaptive bandwidth: 24 CV score: 39.88842 
## Adaptive bandwidth: 23 CV score: 40.12316 
## Adaptive bandwidth: 23 CV score: 40.12316 
## Adaptive bandwidth: 22 CV score: 40.45429 
## Adaptive bandwidth: 22 CV score: 40.45429

Results

gwr_model <- gwr.basic(
  formula = gwr_formula,
  data = gwr_data,
  bw = bw, 
  adaptive = T)

gwr_result <- gwr_model$SDF
summary(gwr_result)
## Object of class SpatialPointsDataFrame
## Coordinates:
##          min        max
## X -123.89432 -115.36552
## Y   33.03547   41.74308
## Is projected: FALSE 
## proj4string : [+proj=longlat +datum=WGS84 +no_defs]
## Number of points: 58
## Data attributes:
##    Intercept       avg_DriveAlone_Dist avg_Driveothers_Dist avg_Passenger_Dist
##  Min.   :-0.4934   Min.   :0.9145      Min.   :0.8048       Min.   :0.8516    
##  1st Qu.: 0.4961   1st Qu.:0.9518      1st Qu.:0.9750       1st Qu.:0.8755    
##  Median : 0.8308   Median :0.9979      Median :1.0536       Median :0.9178    
##  Mean   : 0.9485   Mean   :1.0118      Mean   :1.0237       Mean   :0.9157    
##  3rd Qu.: 1.4073   3rd Qu.:1.0629      3rd Qu.:1.0721       3rd Qu.:0.9444    
##  Max.   : 2.5491   Max.   :1.1466      Max.   :1.1642       Max.   :1.0113    
##  avg_Walk_Dist     avg_Bike_Dist         y              yhat      
##  Min.   :-0.9423   Min.   :1.616   Min.   :18.13   Min.   :19.12  
##  1st Qu.: 1.6040   1st Qu.:2.339   1st Qu.:25.74   1st Qu.:25.73  
##  Median : 4.0746   Median :2.797   Median :28.15   Median :28.10  
##  Mean   : 3.5652   Mean   :2.692   Mean   :28.95   Mean   :28.98  
##  3rd Qu.: 5.6299   3rd Qu.:3.087   3rd Qu.:31.11   3rd Qu.:31.31  
##  Max.   : 6.9179   Max.   :3.916   Max.   :40.88   Max.   :40.47  
##     residual            CV_Score Stud_residual       Intercept_SE   
##  Min.   :-1.035504   Min.   :0   Min.   :-2.66169   Min.   :0.6972  
##  1st Qu.:-0.321663   1st Qu.:0   1st Qu.:-0.75237   1st Qu.:0.8132  
##  Median : 0.005585   Median :0   Median : 0.01299   Median :0.9091  
##  Mean   :-0.029618   Mean   :0   Mean   :-0.11165   Mean   :0.9201  
##  3rd Qu.: 0.244794   3rd Qu.:0   3rd Qu.: 0.49366   3rd Qu.:1.0281  
##  Max.   : 1.141008   Max.   :0   Max.   : 3.06768   Max.   :1.2451  
##  avg_DriveAlone_Dist_SE avg_Driveothers_Dist_SE avg_Passenger_Dist_SE
##  Min.   :0.04545        Min.   :0.07797         Min.   :0.07302      
##  1st Qu.:0.04917        1st Qu.:0.08661         1st Qu.:0.08168      
##  Median :0.05427        Median :0.09965         Median :0.09066      
##  Mean   :0.05475        Mean   :0.09926         Mean   :0.09005      
##  3rd Qu.:0.05991        3rd Qu.:0.10985         3rd Qu.:0.09689      
##  Max.   :0.07354        Max.   :0.12553         Max.   :0.11771      
##  avg_Walk_Dist_SE avg_Bike_Dist_SE  Intercept_TV     avg_DriveAlone_Dist_TV
##  Min.   :0.8861   Min.   :1.050    Min.   :-0.6075   Min.   :12.72         
##  1st Qu.:1.0182   1st Qu.:1.122    1st Qu.: 0.5059   1st Qu.:17.83         
##  Median :1.1827   Median :1.170    Median : 0.9482   Median :19.25         
##  Mean   :1.4982   Mean   :1.234    Mean   : 0.9975   Mean   :18.72         
##  3rd Qu.:1.5748   3rd Qu.:1.337    3rd Qu.: 1.6638   3rd Qu.:20.28         
##  Max.   :3.3180   Max.   :1.605    Max.   : 2.6042   Max.   :21.74         
##  avg_Driveothers_Dist_TV avg_Passenger_Dist_TV avg_Walk_Dist_TV 
##  Min.   : 6.411          Min.   : 7.903        Min.   :-0.6299  
##  1st Qu.: 9.098          1st Qu.: 9.168        1st Qu.: 1.0900  
##  Median :10.650          Median :10.414        Median : 2.0566  
##  Mean   :10.565          Mean   :10.306        Mean   : 2.9079  
##  3rd Qu.:12.364          3rd Qu.:11.350        3rd Qu.: 5.5485  
##  Max.   :13.466          Max.   :12.864        Max.   : 6.4995  
##  avg_Bike_Dist_TV    Local_R2     
##  Min.   :1.107    Min.   :0.9879  
##  1st Qu.:1.860    1st Qu.:0.9922  
##  Median :2.414    Median :0.9937  
##  Mean   :2.236    Mean   :0.9937  
##  3rd Qu.:2.723    3rd Qu.:0.9952  
##  Max.   :3.160    Max.   :0.9978
head(gwr_result)
## class       : SpatialPointsDataFrame 
## features    : 6 
## extent      : -122.8884, -118.8004, 35.38639, 38.52794  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 24
## names       :          Intercept, avg_DriveAlone_Dist, avg_Driveothers_Dist, avg_Passenger_Dist,    avg_Walk_Dist,    avg_Bike_Dist,                y,             yhat,           residual, CV_Score,     Stud_residual,      Intercept_SE, avg_DriveAlone_Dist_SE, avg_Driveothers_Dist_SE, avg_Passenger_Dist_SE, ... 
## min values  : -0.493407547943922,   0.931195113605572,    0.804772664961579,  0.882009855991374, 2.07589095658566, 1.93713278106003, 25.5660966027259, 26.1724569854574, -0.780451312708898,        0, -2.66168916076884, 0.812249140951841,     0.0454497445608676,      0.0779715804092984,    0.0830799886053281, ... 
## max values  :   2.15676300222343,    1.14663162009081,     1.07245452213885,    1.0113345223337, 6.82075707909126, 3.53063802804501, 38.0691114580212, 38.8495627707301,  0.231925827520939,        0, 0.451298627278192,   1.1574966922564,     0.0668625157233102,       0.125534232867506,     0.110703232342893, ...
gwr_sf <- st_as_sf(gwr_result)
ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_DriveAlone_Dist)) + 
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() +  
  theme_minimal() +
  labs(title = "Spatial Variation of avg_DriveAlone_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Driveothers_Dist)) +  
  scale_color_viridis_c() +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Driveothers_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Passenger_Dist)) +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() + 
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Passenger_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Walk_Dist)) +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() +  
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Walk_Dist Coefficient", 
       color = "Coefficient Value")

ggplot(data = gwr_sf) +
  geom_sf(aes(color = avg_Bike_Dist)) +  
  geom_sf(data = boundaries, fill = NA, color = "black", linewidth = 0.2) +
  scale_color_viridis_c() +  
  theme_minimal() +
  labs(title = "Spatial Variation of avg_Bike_Dist Coefficient", 
       color = "Coefficient Value")

Interprotation of the Coefficients

Walking

Coastal areas (especially below 38°N) have significantly higher coefficients (yellow to green), suggesting that walking distance contributes more significantly to changes in total traveling distance.

Driving Alone

Around 38°N, 123°W which is the Bay Area, the higher coefficients indicate that the average drive-alone distance contributes more significantly to changes in total traveling distance.

Driving with Others

Coefficients are higher in southern and northern regions (green regions), while some coastal areas near 38°N have lower coefficients (dark blue), which suggest a converse relationship to the drive-alone distance.

Riding as Passanger

The coefficients are generally low (mostly dark blue) across all regions, particularly in southern areas, suggesting that passenger distance is less influential.

Biking

Coefficients are lower in southern regions (dark purple regions), suggesting that bike distance is less influential.

Summary

Driving Alone: Higher influence in the northern regions.

Driving with Others: Higher influence in central and southern regions.

Walking: Stronger impacts in coastal regions.

Cycling: Stronger impacts in northern and central regions.

Passenger: Riding as a passenger is relatively consistent across regions.